1. Introduction

This project aims to help Zillow better predict its housing market predictions. The project emphasizes on local knowledge, a novel way of looking at the data and creative factors that might enhance predictability of Zillow’s modeling. Currently, Zillow’s model does fine, but this approach offers a different lens into how the model could be built stronger with intel from a variety of internal and external factors. While Zillow may be able to predict housing prices by just looking at sales, taxes and a few other economic factors, this project offers a new model that looks at the city of Philadelphia from the forces that dictate the quotidien, such as amenities like schools, parks, public spaces, as well as social characteristics like poverty rates, gathering spaces and markets, etc.

This project is best understood as an exploratory analysis of how the data of Philadelphia and its unique characteristics can influence a model. This project, first and foremost, offers a fresh perspective into why and how native factors better predictability.

This exercise amplifies every data scientist’s apprehension; no matter how objective one tries to be, one always brings personal biases in selecting testing factors. In this specific case, we as urban data scientists, believe that some of the most important factors in consideration should be those that affect quotidian urban lives. This includes a mix of demographics, spatial factors and physical features, and social characteristics of Philly. This makes being objective, and hence the project a difficult exercise. However, this model tries to experiment with possibilities and strive to only perfect the model to predict the best way we can whilst making sure the character of the city is embedded in the model.

The result of our analysis is a house value model based on our hypothesis of the impact of various internal and external factors as well as spatial clustering. The generalizability and accuracy of the model can further be explained through tests and measures of error/error percentage, which can be further explored below.

2. Set-up

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(sf)
library(data.table)
library(spdep)
library(caret)
library(dplyr)
library(ckanr)
library(MASS)
library(FNN)
library(viridis)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(jtools)     # for regression model plots
library(broom)
library(tufte)
library(ggplot2)
library(rmarkdown)
library(stargazer)
library(splm)
library(kableExtra)
library(corrplot)
library(tidycensus)
library(FNN)
library(psych)
# functions and data directory
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"

source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

# Colors
bluePalette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
blue2Palette5 <- c("#08519c","#3182bd","#6baed6","#bdd7e7","#eff3ff")
orangePalette5 <- c("#ffffb2","#fecc5c","#fd8d3c","#f03b20","#bd0026")
greyPalette5 <- c("#f7f7f7","#cccccc","#969696","#636363","#252525")

# LoadAPI
census_api_key("52c094b268b1f62d3845601acbd2ae784e75ccc9", overwrite = TRUE)

3. DATA

3.1 Data Wrangling

For this project, we used data from Zillow for which the unit of analysis was at house unit level. This dataset included important factors about each unit such as number of bedrooms, bathrooms, garage space, floor area, etc. The model also borrowed data from OpenDataPhilly around crime, trees, Heat Index, Farmers Markets, etc. Finally, it collected data from the Census Bureau around education, poverty, race and population.

# sdknn <- sd %>% dplyr::select(objectid, geometry, sale_price)
# 
# # finding counts by group
# crime_grouped <- crime %>% 
# group_by(text_general_code) %>%
#   summarize(count = n()) %>%
#   arrange(-count) %>%
#   kable() %>%
#   kable_styling()
# 
# crime_grouped
# 
# unique(crime$text_general_code)
# crimepts <- crime %>%
#   filter(text_general_code %in% c("Aggravated Assault No Firearm", "Thefts", "Other Assaults", "Theft from Vehicle", "Aggravated Assault Firearm", "Burglary Residential", "Robbery No Firearm", "Robbery Firearm", "Rape", "Homicide - Criminal"),
#          lat > -1) %>%
#   dplyr::select(lat, lng) %>%
#   na.omit() %>%
#   st_as_sf(coords = c("lng", "lat"), crs = "EPSG:4326") %>%
#   st_transform('ESRI:102286') %>%
#   distinct()
# 
# sdknn$crimebuff <- sdknn %>% 
#     st_buffer(660) %>% 
#     aggregate(mutate(crimepts, counter = 1),., sum) %>%
#     pull(counter)
# 
# sdknn$crimebuff[is.na(sdknn$crimebuff)] <- 0
# 
# ## Nearest Neighbor Feature
# 
# coord<- st_coordinates(sdknn)
# crimecoord <- st_coordinates(crimepts)
# 
# sdknn <-
#   sdknn %>% 
#     mutate(
#       crime_nn1 = nn_function(coord, 
#                               crimecoord, k = 1),
#       
#       crime_nn2 = nn_function(coord, 
#                               crimecoord, k = 2), 
#       
#       crime_nn3 = nn_function(coord, 
#                               crimecoord, k = 3), 
#       
#       crime_nn4 = nn_function(coord, 
#                               crimecoord, k = 4), 
#       
#       crime_nn5 = nn_function(coord, 
#                               crimecoord, k = 5)) 
# 
# med <- med %>%
#   cbind(st_coordinates(med$geometry))
# 
# medpts <- med %>%
#   dplyr::select(X, Y) %>%
#   na.omit() %>%
#   distinct()
# 
# sdknn$medbuff <- sd %>% 
#     st_buffer(1320) %>% 
#     aggregate(mutate(medpts, counter = 1),., sum) %>%
#     pull(counter)
# 
# sdknn$medbuff[is.na(sdknn$medbuff)] <- 0
# medcoord <- st_coordinates(medpts)
# 
# ## Nearest Neighbor Feature
# 
# sdknn <-
#   sdknn %>% 
#     mutate(
#       med_nn1 = nn_function(coord, 
#                               medcoord, k = 1),
#       
#       med_nn2 = nn_function(coord, 
#                               medcoord, k = 2), 
#       
#       med_nn3 = nn_function(coord, 
#                               medcoord, k = 3), 
#       
#       med_nn4 = nn_function(coord, 
#                               medcoord, k = 4), 
#       
#       med_nn5 = nn_function(coord, 
#                               medcoord, k = 5)) 
# 
# 
# heat <- heat %>%
#   dplyr::select(geometry, HVI_SCORE)
# 
# sdknn <- st_join(sdknn, heat, join = st_intersects)
# 
# school <- school %>%
#   cbind(st_coordinates(school$geometry))
# 
# schoolpts <- school %>%
#   dplyr::select(X, Y) %>%
#   na.omit() %>%
#   distinct()
# 
# sdknn$schoolbuff <- sd %>% 
#     st_buffer(660) %>% 
#     aggregate(mutate(schoolpts, counter = 1),., sum) %>%
#     pull(counter)
# 
# sdknn$schoolbuff[is.na(sdknn$schoolbuff)] <- 0
# schoolcoord <- st_coordinates(schoolpts)
# 
# ## Nearest Neighbor Feature
# 
# sdknn <-
#   sdknn %>% 
#     mutate(
#       school_nn1 = nn_function(coord, 
#                               schoolcoord, k = 1),
#       
#       school_nn2 = nn_function(coord, 
#                               schoolcoord, k = 2), 
#       
#       school_nn3 = nn_function(coord, 
#                               schoolcoord, k = 3), 
#       
#       school_nn4 = nn_function(coord, 
#                               schoolcoord, k = 4), 
#       
#       school_nn5 = nn_function(coord, 
#                               schoolcoord, k = 5)) 
# 
# play <- play %>%
#   cbind(st_coordinates(play$geometry))
# 
# playpts <- play %>%
#   dplyr::select(geometry) %>%
#   na.omit() %>%
# 
#   distinct()
# 
# sdknn$playbuff <- sdknn %>% 
#     st_buffer(660) %>% 
#     aggregate(mutate(playpts, counter = 1),., sum) %>%
#     pull(counter)
# 
# sdknn$playbuff[is.na(sdknn$playbuff)] <- 0
# playcoord <- st_coordinates(playpts)
# 
# ## Nearest Neighbor Feature
# 
# sdknn <-
#   sdknn %>% 
#     mutate(
#       play_nn1 = nn_function(coord, 
#                               playcoord, k = 1),
#       
#       play_nn2 = nn_function(coord, 
#                               playcoord, k = 2), 
#       
#       play_nn3 = nn_function(coord, 
#                               playcoord, k = 3), 
#       
#       play_nn4 = nn_function(coord, 
#                               playcoord, k = 4), 
#       
#       play_nn5 = nn_function(coord, 
#                               playcoord, k = 5)) 
# 
# 
# grocer$pointgeometry <- st_centroid(grocer$geometry)
# 
# grocer <- grocer %>%
#   cbind(st_coordinates(grocer$pointgeometry))
# 
# grocerpts <- grocer %>%
#   dplyr::select(X, Y) %>%  # Assuming X corresponds to longitude and Y corresponds to latitude
#   na.omit() %>%
#   st_as_sf(coords = c("X", "Y"), crs = "EPSG:4326") %>%
#   st_transform('ESRI:102286') %>%
#   distinct()
# 
# sdknn$grocerbuff <- sdknn %>% 
#     st_buffer(660) %>% 
#     aggregate(mutate(grocerpts, counter = 1),., sum) %>%
#     pull(counter)
# 
# sdknn$grocerbuff[is.na(sdknn$grocerbuff)] <- 0
# grocercoord <- st_coordinates(grocerpts)
# 
# ## Nearest Neighbor Feature
# 
# #sdknn <-
# #  sdknn %>% 
#  #   mutate(
#  #     grocer_nn1 = nn_function(coord, 
#  #                             grocercoord, k = 1),
#       
# #      grocer_nn2 = nn_function(coord, 
#  #                             grocercoord, k = 2), 
#       
#  #     grocer_nn3 = nn_function(coord, 
#   #                            grocercoord, k = 3), 
#       
#   #    grocer_nn4 = nn_function(coord, 
#    #                           grocercoord, k = 4), 
#       
#    #   grocer_nn5 = nn_function(coord, 
#   #                            grocercoord, k = 5)) 
# 
# 
# fmarket <- fmarket %>%
#   cbind(st_coordinates(fmarket$geometry))
# 
# # Assuming fmarket is your geodataframe
# fmarketpts <- fmarket %>%
#   dplyr::select(X, Y) %>%  # Assuming X corresponds to longitude and Y corresponds to latitude
#   na.omit() %>%
# #  st_as_sf(coords = c("X", "Y"), crs = "EPSG:4326") %>%
#  # st_transform('ESRI:102286') %>%
#   distinct()
# 
# 
# sdknn$fmarketbuff <- sdknn %>% 
#     st_buffer(1980) %>% 
#     aggregate(mutate(fmarketpts, counter = 1),., sum) %>%
#     pull(counter)
# 
# sdknn$fmarketbuff[is.na(sdknn$fmarketbuff)] <- 0
# fmarketcoord <- st_coordinates(fmarket)
# 
# ## Nearest Neighbor Feature
# 
# sdknn <-
#   sdknn %>% 
#     mutate(
#       fmarket_nn1 = nn_function(coord, 
#                               fmarketcoord, k = 1),
#       
#       fmarket_nn2 = nn_function(coord, 
#                               fmarketcoord, k = 2), 
#       
#       fmarket_nn3 = nn_function(coord, 
#                               fmarketcoord, k = 3), 
#       
#       fmarket_nn4 = nn_function(coord, 
#                               fmarketcoord, k = 4), 
#       
#       fmarket_nn5 = nn_function(coord, 
#                               fmarketcoord, k = 5)) 
# 
# 
# 
# treepts <- tree %>%
#   dplyr::select(LOC_X, LOC_Y) %>%
#   na.omit() %>%
#   st_as_sf(coords = c("LOC_Y", "LOC_X"), crs = "EPSG:4326") %>%
#   st_transform('ESRI:102286') %>%
#   distinct()
# 
# sdknn$treebuff <- sdknn %>% 
#     st_buffer(660) %>% 
#     aggregate(mutate(treepts, counter = 1),., sum) %>%
#     pull(counter)
# 
# sdknn$treebuff[is.na(sdknn$treebuff)] <- 0
# treecoord <- st_coordinates(treepts)
# 
# ## Nearest Neighbor Feature
# 
# sdknn <-
#   sdknn %>% 
#     mutate(
#       tree_nn1 = nn_function(coord, 
#                               treecoord, k = 1),
#       
#       tree_nn2 = nn_function(coord, 
#                               treecoord), k = 2), 
#       
#       tree_nn3 = nn_function(coord, 
#                               treecoord, k = 3), 
#       
#       tree_nn4 = nn_function(coord, 
#                               treecoord, k = 4), 
#       
#       tree_nn5 = nn_function(coord), 
#                               treecoord), k = 5)) 

3.2 Summary Tables

We first will describe here what factors could into the project. Through a series of tests, we then will determine which ones of these will be best suited and generizable. The following three tables look at the summary statistics of internal characteristics, nearby amenities, and demographic data.

numericVars <- 
 select_if(st_drop_geometry(sd), is.numeric) %>% na.omit()

stargazer(numericVars, type = "text", title = "Summary Statistics by Internal Characteristicss", 
                     summary = TRUE, out = "summary_statistics_internal_chars.txt", digits = 2)
## 
## Summary Statistics by Internal Characteristicss
## ===================================================================================
## Statistic            N        Mean          St. Dev.         Min           Max     
## -----------------------------------------------------------------------------------
## objectid            429  360,324,401.00    12,349.73     360,302,736   360,345,502 
## category_code       429       1.00            0.00            1             1      
## census_tract        429      129.51          109.76          15            388     
## depth               429      76.32           39.22            0            337     
## exempt_land         429      57.76           848.89           0          13,580    
## exterior_condition  429       3.05            1.25            1             7      
## fireplaces          429       0.13            0.49            0             5      
## frontage            429      22.54           27.28            0            388     
## garage_spaces       429       0.24            0.53            0             3      
## geographic_ward     429      30.53           14.02            1            65      
## house_number        429     2,335.85        2,234.71          3           9,218    
## interior_condition  429       2.64            1.18            1             7      
## number_of_bathrooms 429       1.77            0.76            1             4      
## number_of_bedrooms  429       3.14            0.83            1             8      
## number_of_rooms     429       6.53            1.63            4            15      
## number_stories      429       2.29            0.52            1             3      
## off_street_open     429     1,892.79        1,385.27         118          8,330    
## parcel_number       429  312,587,766.00  148,309,143.00  11,020,200    888,210,597 
## sale_price          429    391,861.80      304,197.40         0         3,150,000  
## street_code         429    54,943.27       23,246.32       11,160        89,390    
## total_area          429     2,238.17        4,881.59          0          57,987    
## total_livable_area  429     1,565.13         752.21          784          6,562    
## year_built          429     1,935.16         31.35          1,856         2,023    
## zip_code            429    19,138.19         11.07         19,114        19,153    
## pin                 429 1,001,366,978.00   169,449.80   1,001,051,076 1,001,682,780
## sale_year           429     2,022.26          0.44          2,022         2,023    
## musaID              429    12,452.80        6,976.42         93          23,693    
## TotalPop            429     4,190.82        1,510.95        1,027         9,047    
## MedHHInc            429    66,712.56       23,595.44       14,798        144,531   
## MedRent             429     1,086.22         258.56          441          2,339    
## pctWhite            429       0.43            0.29         0.0003         1.12     
## pctBachelors        429       0.02            0.02          0.00          0.12     
## pctPoverty          429       0.17            0.11          0.01          0.53     
## -----------------------------------------------------------------------------------
#filter to numeric values only
numericVars_sdknn <- 
 select_if(st_drop_geometry(sdknn), is.numeric) %>% na.omit()
#table
stargazer(numericVars_sdknn, type = "text", title = "Summary Statistics by Nearby Amenities", 
                     summary = TRUE, out = "summary_statistics_amenities.txt", digits = 2)
## 
## Summary Statistics by Nearby Amenities
## ====================================================================
## Statistic     N         Mean       St. Dev.      Min         Max    
## --------------------------------------------------------------------
## objectid    23,881 360,323,428.00 12,257.32  360,302,636 360,352,508
## sale_price  23,881   271,846.80   240,083.40      0       5,990,000 
## crimebuff   23,881     124.32       61.15         1          274    
## crime_nn1   23,881     55.66        33.19       7.63       611.50   
## crime_nn2   23,881     67.01        36.65       11.71      611.97   
## crime_nn3   23,881     78.34        40.18       16.34      798.31   
## crime_nn4   23,881     88.39        43.69       20.69      893.84   
## crime_nn5   23,881     97.56        47.06       24.76      952.43   
## medbuff     23,881      0.81         1.06         0           5     
## med_nn1     23,881    1,610.20     1,065.29     24.22     7,948.94  
## med_nn2     23,881    1,945.62     1,145.59     77.40     8,222.91  
## med_nn3     23,881    2,250.18     1,193.67    219.05     8,516.00  
## med_nn4     23,881    2,598.14     1,281.80    510.08     8,844.52  
## med_nn5     23,881    2,878.34     1,369.47    664.10     9,139.81  
## HVI_SCORE   23,881     -0.33         4.15       -8.35       8.17    
## schoolbuff  23,881      3.05         2.13         0          14     
## school_nn1  23,881     349.98       197.48      9.89      1,799.39  
## school_nn2  23,881     447.53       208.29      46.57     2,291.26  
## school_nn3  23,881     531.39       226.94      68.48     2,490.30  
## school_nn4  23,881     607.32       245.87     125.38     2,612.69  
## school_nn5  23,881     676.50       266.53     172.15     2,729.48  
## playbuff    23,881      7.90         9.07         0          45     
## play_nn1    23,881     788.40      1,148.03     8.93      6,502.55  
## play_nn2    23,881     971.65      1,422.61     11.65     8,388.83  
## play_nn3    23,881    1,067.08     1,554.41     11.65     9,261.06  
## play_nn4    23,881    1,135.92     1,634.49     16.73     9,727.01  
## play_nn5    23,881    1,191.68     1,687.84     43.40     10,049.93 
## grocerbuff  23,881     16.74         6.30         2          35     
## fmarketbuff 23,881      1.55         1.56         0           8     
## fmarket_nn1 23,881    1,758.53     1,510.20     23.82     9,311.52  
## fmarket_nn2 23,881    2,208.60     1,726.27    177.92     10,644.13 
## fmarket_nn3 23,881    2,622.36     1,868.10    240.85     11,683.51 
## fmarket_nn4 23,881    3,180.40     2,222.16    601.64     13,237.11 
## fmarket_nn5 23,881    3,604.04     2,466.98    844.08     14,291.91 
## treebuff    23,881     939.55       761.23       12         4,103   
## --------------------------------------------------------------------
# join sd and neighborhoods 
joined_nhoods_sd <- st_join(nhoods, sd, join = st_intersects)
#filter to numeric values only
numericVars_nhoods <- 
 select_if(st_drop_geometry(joined_nhoods_sd), is.numeric) %>% na.omit()
#table
stargazer(numericVars_nhoods, type = "text", title = "Summary Statistics by Demographics", 
                     summary = TRUE, out = "summary_statistics_demographics.txt", digits = 2)
## 
## Summary Statistics by Demographics
## ===================================================================================
## Statistic            N        Mean          St. Dev.         Min           Max     
## -----------------------------------------------------------------------------------
## TotalPop.x          429     4,190.82        1,510.95        1,027         9,047    
## MedHHInc.x          429    66,712.56       23,595.44       14,798        144,531   
## MedRent.x           429     1,086.22         258.56          441          2,339    
## pctWhite.x          429       0.43            0.29         0.0003         1.12     
## pctBachelors.x      429       0.02            0.02          0.00          0.12     
## pctPoverty.x        429       0.17            0.11          0.01          0.53     
## objectid            429  360,324,401.00    12,349.73     360,302,736   360,345,502 
## category_code       429       1.00            0.00            1             1      
## census_tract        429      129.51          109.76          15            388     
## depth               429      76.32           39.22            0            337     
## exempt_land         429      57.76           848.89           0          13,580    
## exterior_condition  429       3.05            1.25            1             7      
## fireplaces          429       0.13            0.49            0             5      
## frontage            429      22.54           27.28            0            388     
## garage_spaces       429       0.24            0.53            0             3      
## geographic_ward     429      30.53           14.02            1            65      
## house_number        429     2,335.85        2,234.71          3           9,218    
## interior_condition  429       2.64            1.18            1             7      
## number_of_bathrooms 429       1.77            0.76            1             4      
## number_of_bedrooms  429       3.14            0.83            1             8      
## number_of_rooms     429       6.53            1.63            4            15      
## number_stories      429       2.29            0.52            1             3      
## off_street_open     429     1,892.79        1,385.27         118          8,330    
## parcel_number       429  312,587,766.00  148,309,143.00  11,020,200    888,210,597 
## sale_price          429    391,861.80      304,197.40         0         3,150,000  
## street_code         429    54,943.27       23,246.32       11,160        89,390    
## total_area          429     2,238.17        4,881.59          0          57,987    
## total_livable_area  429     1,565.13         752.21          784          6,562    
## year_built          429     1,935.16         31.35          1,856         2,023    
## zip_code            429    19,138.19         11.07         19,114        19,153    
## pin                 429 1,001,366,978.00   169,449.80   1,001,051,076 1,001,682,780
## sale_year           429     2,022.26          0.44          2,022         2,023    
## musaID              429    12,452.80        6,976.42         93          23,693    
## TotalPop.y          429     4,190.82        1,510.95        1,027         9,047    
## MedHHInc.y          429    66,712.56       23,595.44       14,798        144,531   
## MedRent.y           429     1,086.22         258.56          441          2,339    
## pctWhite.y          429       0.43            0.29         0.0003         1.12     
## pctBachelors.y      429       0.02            0.02          0.00          0.12     
## pctPoverty.y        429       0.17            0.11          0.01          0.53     
## -----------------------------------------------------------------------------------

3.3 Correlation Matrix

This section evaluates which of the factors that we chose are highly related to one another, and hence will not prove to be useful for the model. In the matrix below, we can see that factors like street code and pincode are correlated, in addition to number of bed rooms and rooms, ward and parcel number, and lastly frontage and total area. In demographic factors, percent poverty and median rent and median household income seem to be correlated. For this reason, we will consider the following factors and leave out the rest: Exterior and interiors conditions, garage space, number of bedrooms, number of bathrooms, number of stories, fireplace, tyoe of view, year of housing built, and total built area.

ggcorrplot(
 round(cor(numericVars_nhoods), 1), 
 p.mat = cor_pmat(numericVars_nhoods),
 colors = c("blue", "white", "red"),
 type="lower",
 insig = "blank",
 tl.cex = 6) +  
 labs(title = "Correlation across Different Variables") +
 mapTheme()

3.4 Home Price Correlation Matrix Scatterplots

Plot 1: Total area versus Sales Price As the total area of the unit increases, so does the price. This may indicate that places with larger houses, typically away from dense settlements might have higher sales prices.

Plot 2: Year built versus sales price Newer construction is definitely more valued. Houses built post 1950s bring values of more than a million dollars. However, there seem to be many outliers that make it difficult to generalize.

Plot 3: Number of bedrooms versus sales price

Looks like sales price increases as the number of bedrooms increases. This might be because of simply more space means more prices but also because after a certain number of bedrooms, prices increases indepently too.

Plot 4: Trees in buffer versus sales price This plot definitely shows a high slope, or that sales prices go further up in neighborhoods or tracts with high number of trees.

css1 <- ggplot(numericVars, aes(x = sale_price, y = total_area)) +
  geom_point(colour = "grey", size = 1) +
  geom_smooth(method = "lm", se = FALSE, colour = "blue", size = 1) +
  labs(subtitle = "Plot1: House Area",
       x = "Price",
       y = "House Area") +
  theme_minimal()


css2 <- ggplot(numericVars, aes(x = sale_price, y = year_built)) +
  geom_point(colour = "grey", size = 1) +
  geom_smooth(method = "lm", se = FALSE, colour = "blue", size = 1) +
  labs(subtitle = "Plot2: Year Built",
       x = "Price",
       y = "Year house was built") +
  theme_minimal()

#css2

css3 <- ggplot(numericVars, aes(x = sale_price, y = number_of_bedrooms)) +
  geom_point(colour = "grey", size = 1) +
  geom_smooth(method = "lm", se = FALSE, colour = "blue", size = 1) +
  labs(subtitle = "Plot3 : Number of Bedrooms",
       x = "Price",
       y = "Bedrooms") +
  theme_minimal()

#css3


css4 <- ggplot(numericVars_sdknn, aes(x = sale_price, y = treebuff)) +
  geom_point(colour = "grey", size = 1) +
  geom_smooth(method = "lm", se = FALSE, colour = "blue", size = 1) +
  labs(subtitle = "Plot4 : Trees in buffer",
       x = "Price",
       y = "Number of trees in a buffer") +
  theme_minimal()

#css4

grid.arrange(css1, css2, css3, css4)

3.5 Map of Home Price

The following map visualizes the home sale price dataset. The census tract geometry was chosen for this visualization to look at how nodes in Philadephia differ in mean home value. According to this map, the most expensive areas are in Center City, University City, and northwest Philadelphia.

# for age_house
sd <- sd %>%
  mutate(age = ifelse(year_built > 1000, 2021 - year_built, 0))

# grouping by census tract
sd_ct <- sd %>%
  group_by(census_tract) %>%
 summarise(
    sale_price = mean(sale_price, na.rm = TRUE),
    total_area = mean(total_area, na.rm = TRUE), 
    age = mean(age, na.rm = TRUE)
  )  %>%
  rename(NAME10 = census_tract)

sd_ct$NAME10 <- as.character(as.numeric(sd_ct$NAME10))

#joined_data <- left_join(sd_ct, ct, by = "NAME10")
joined_data <- st_join(nhoods, sd_ct, join = st_intersects)

joined_data <- joined_data[is.finite(joined_data$sale_price), ]

joined_data <- joined_data[is.finite(joined_data$total_area), ]

joined_data <- joined_data[is.finite(joined_data$age), ]

ggplot() +
  geom_sf(data = ct, color = "black", size = 0.3) +
  geom_sf(data = joined_data, aes(fill = sale_price), color = "black", size = 0.3) +
  scale_fill_gradient(low = "lightblue", high = "darkblue", guide = "legend",
                      name = "Sale Price", 
                      breaks = seq(min(joined_data$sale_price), max(joined_data$sale_price), length.out = 5)) +
  labs(title = "Home Sale Prices by Census Tract",
       subtitle = "Philadelphia",
       caption = "  ") +
  mapTheme()

3.6 Maps of Independent Variables

Map 1: Total area This map shows the size of houses by area. It seems like most houses in the city have about 100 square footage but the ones in the West of the city are super big.

Map 2: Age This map shows home age by census tracts and it seems like most houses are close to a 100 years old but newer construction has happened in the west and north parts of the city.

Map 3: Median Household Income This map shows median household income and it seems like the western and northern parts of the city have comparatively higher median household income. This will be useful in looking at observed home pricing and help predict home pricing as well.

Map 4: Median Rent This map shows median house rent, similar to the household income map, the western and northern parts of the city have higher rents.

Map 5: Percent White Population This map has percent white population. There is a higher percent white population as we move away from the city.

#livable area

map1 <- ggplot() +
  geom_sf(data = ct, color = "black", size = 0.3) +
  geom_sf(data = joined_data, aes(fill = total_area), color = "black", size = 0.3) +
  scale_fill_gradient(low = "lightblue", high = "darkblue", guide = "legend",
                      name = "Area", 
                      breaks = seq(min(joined_data$total_area), max(joined_data$total_area), length.out = 5)) +
  labs(title = "Home total area by Census Tract",
       subtitle = "Total areas for houses in Philly",
       caption = "") +
  mapTheme()

#age
map2 <- ggplot() +
  geom_sf(data = ct, color = "black", size = 0.3) +
  geom_sf(data = joined_data, aes(fill = age), color = "black", size = 0.3) +
  scale_fill_gradient(low = "lightblue", high = "darkblue", guide = "legend",
                      name = "Age", 
                      breaks = seq(min(joined_data$age), max(joined_data$age), length.out = 5)) +
  labs(title = "Home age by Census Tract",
       subtitle = "Age of housing in Philly",
       caption = "") +
  mapTheme()

#HHINC
map3 <- ggplot() +
  geom_sf(data = ct, color = "black", size = 0.3) +
  geom_sf(data = joined_data, aes(fill = MedHHInc), color = "black", size = 0.3) +
  scale_fill_gradient(low = "lightblue", high = "darkblue", guide = "legend",
                      name = "Median Income in Dollars", 
                      #breaks = seq(min(joined_data$MedHHInc), max(joined_data$MedHHInc), length.out = 5)
                      ) +
  labs(title = "Median Household Income by Census Tract",
       subtitle = "Household income in Philly",
       caption = "") +
  mapTheme()

## house rent
map4 <- ggplot() +
  geom_sf(data = ct, color = "black", size = 0.3) +
  geom_sf(data = joined_data, aes(fill = MedRent), color = "black", size = 0.3) +
  scale_fill_gradient(low = "lightblue", high = "darkblue", guide = "legend",
                      name = "Median Rent in Dollars", 
                      #breaks = seq(min(joined_data$MedHHInc), max(joined_data$MedHHInc), length.out = 5)
                      ) +
  #geom_sf(data= sd, na.rm = TRUE, aes(color = sale_price ), size = 3, alpha = 0.01) +
  labs(title = "Median House Rent by Census Tract",
       subtitle = "House rent in Philly",
       caption = "") +
  mapTheme()


## percent white population
map5 <- ggplot() +
  geom_sf(data = ct, color = "black", size = 0.3) +
  geom_sf(data = joined_data, aes(fill = pctWhite), color = "black", size = 0.3) +
  scale_fill_gradient(low = "lightblue", high = "darkblue", guide = "legend",
                      name = "Percent White Population", 
                      breaks = seq(min(joined_data$pctWhite), max(joined_data$pctWhite), length.out = 5)
                      ) +
  #geom_sf(data= sd, na.rm = TRUE, aes(color = sale_price ), size = 3, alpha = 0.01) +
  labs(title = "Percent White Population by Census Tract",
       subtitle = "House rent in Philly",
       caption = "") +
  mapTheme()

map1

map2

map3

map4

map5

3.7 Other Maps

Some other maps that would be helpful to our analysis include kernel density maps that visualize the density and spread of the various selected public amenities and other nearby facilities/neighbourhood features of Philadelphia. These include tree, hospital, farmer’s market, school, play area, grocery store, and crime spread.

Map 1: Tree the greatest density of trees exist across central Philadelphia, across center city and university city, as well as various parts of northwest Philadelphia. Trees are more scarce in other areas.

Map 2: Hospital Hospitals are not very well-distributes across Philadelphia, with the greatest density in center city, and few in other areas.

Map 3: Farmers’ Market Farmers’ markets also seem to concentrate in center city, as well as university city and northwest Philadelphia.

Map 4: School Schools are most evenly distributed across the city, while the northeast Philadelphia area has less schools spatially.

Map 5: Grocery Store Grocery stores are also similarly well-spread throughout, with certain pockets of concentration in the typically dense areas (center city, university city) as well as a particular node in North Philadelphia.

Map 6: Play streets Play streets are visibly more concentrated in center city, with other nodes of concentration across the city.

Map 7: Crime Crime (the crime identified in this set are those self-chosen as crimes that would most affect residential areas, i.e. violent or property crime) has a distinct concentration in South Philadelphia and parts of North Philadelphia.

## TREE DENSITY
treemap <- ggplot() + geom_sf(data = ct, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(tree)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "lightgreen", high = "darkgreen", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density of Trees, Philadelphia") +
  mapTheme()

## MED DENSITY
medmap <- ggplot() + geom_sf(data = ct, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(med)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.7, bins = 20, geom = 'polygon') +
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density of Hospitals, Philadelphia") +
  mapTheme()

## FARMERS MARKET DENSITY
fmarketmap <- ggplot() + geom_sf(data = ct, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(fmarket)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.7, bins = 20, geom = 'polygon') +
  scale_fill_gradient(low = "beige", high = "brown", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density of Farmers' Markets, Philadelphia") +
  mapTheme()

## SCHOOL DENSITY
schoolmap <- ggplot() + geom_sf(data = ct, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(school)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.8, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "lightyellow", high = "lightblue", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density of Schools, Philadelphia") +
  mapTheme()

## GROCER DENSITY
grocermap <- ggplot() + geom_sf(data = ct, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(grocer)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.7, bins = 20, geom = 'polygon') +
  scale_fill_gradient(low = "lightyellow", high = "lightgreen", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density of Grocers, Philadelphia") +
  mapTheme()

## PLAY AREA DENSITY
playmap <- ggplot() + geom_sf(data = ct, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(tree)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "lightgreen", high = "lightblue", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density of Play Areas, Philadelphia") +
  mapTheme()

## CRIME DENSITY
crimepts <- crime %>%
  filter(text_general_code %in% c("Aggravated Assault No Firearm", "Thefts", "Other Assaults", "Theft from Vehicle", "Aggravated Assault Firearm", "Burglary Residential", "Robbery No Firearm", "Robbery Firearm", "Rape", "Homicide - Criminal"),
         lat > -1) %>%
  dplyr::select(lat, lng) %>%
  na.omit() %>%
  st_as_sf(coords = c("lng", "lat"), crs = "EPSG:4326") %>%
  st_transform('ESRI:102286') %>%
  distinct()

crimemap <- ggplot() + geom_sf(data = ct, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(crimepts)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "#25CB10", high = "#FA7800", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density of Violent Crimes, Philadelphia") +
  mapTheme()

treemap

medmap

fmarketmap

schoolmap

grocermap

playmap

crimemap

4. Methods

Based on our data wrangling and data visualizations, we were able to select variables to best support a model to predict home value. Specifically, this model uses machine learning and geospatial analysis to create a hedonic model based on a range of variables. These variables are as previously described, including internal characteristics, public amenities/services, and other spatial information.

Among various statistical models abstracting reality, the Ordinary Linear Regression (OLS) model was chosen due to its relatively transparent process and computational efficiency. Initially, 70% of the data served as a ‘Training Set’ for building the predictive model, while the remaining 30% constituted the ‘Test Set’ for evaluating the accuracy and generalizability of the model. Through the evaluation, the test set was analyzed through various new variables (level of error, lag error, R-squared, and other variables).

These methods are explored and interpreted further in a step-by-step manner in the following section.

5. Results

5.1 Table of Results (Training Set)

As previously mentioned, our original dataset is split into a training and test set, with the training set building the model. This modelling training is is applied an OLS regression model. For internal characteristics of units, we used the number of bedrooms, bathrooms, stories, year built, exterior, interior, garage space, fireplaces, view type, and total area. For amenities and public service we used crime, school, grocery, heat, trees, playstreets, and hospitals. As part of spatial features we looked at percent white, median household income, and percent poverty. This training regression is pulled to get a summary table of each of the variables as well as the r-squared. This summary table shows the estimate values for the coefficients as well as the r-squared value. The R-squared value is about 0.57, indicating that approximately 57% of the variance of sale price can be explained by our model.

sdmerge.modelling <- sdmerge %>%
  filter(toPredict == "MODELLING")

set.seed(750)
inTrain <- createDataPartition(
              y = paste(sdmerge.modelling$sale_price
                        ), 
              p = .70, list = FALSE)
training <- sdmerge.modelling[inTrain,] 
test <- sdmerge.modelling[-inTrain,]  

reg.training <- 
  lm(sale_price ~ ., data = as.data.frame(training) %>% 
                             dplyr::select(sale_price,
                                           exterior_condition,
                                           garage_spaces,
                                           interior_condition, 
                                           MedRent,
                                           pctWhite,
                                           pctPoverty,
                                           number_of_bedrooms,
                                           number_of_bathrooms,
                                           number_stories,
                                           fireplaces,
                                           garage_spaces,
                                           view_type,
                                           year_built,
                                           total_area,
                                           crimebuff,
                                           crime_nn4,
                                           crime_nn5,
                                           playbuff,
                                           schoolbuff,
                                           school_nn2,
                                           school_nn3,
                                           school_nn5,
                                           medbuff,
                                           grocerbuff,
                                           treebuff,
                                           fmarketbuff,
                                           HVI_SCORE,
                                           ))

test <-
  test %>% #create predictions here, compare predictions to the observed
  mutate(Regression = "Baseline Regression",
         sale_price.Predict = predict(reg.training, test),
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price.Predict)%>%
  filter(sale_price < 5000000) 
# 
# mean(test$sale_price.APE, na.rm = TRUE)
# mean(test$sale_price)
# mean(test$sale_price.AbsError, na.rm = TRUE)

tidyr.t <- tidy(reg.training)
r_squared <- summary(reg.training)$r.squared
tidyr.t <- rbind(tidyr.t, c("R-squared", NA, NA, NA, r_squared))
kable(tidyr.t, format = "html", digits = 4, caption = "Summary Table for Training Set") %>%
  kable_classic(full_width = T, html_font = "Helvetica", font_size = 15) %>%
  row_spec(0, bold = T) %>%
  footnote(general = "", fixed_small_size = T, general_title = "")
Summary Table for Training Set
term estimate std.error statistic p.value
(Intercept) -345427.576237817 87201.1290883935 -3.96127412395849 7.48551228104291e-05
exterior_condition -22464.3204990188 2280.8224051337 -9.84921949576427 7.95201409155221e-23
garage_spaces 21445.0170492749 1846.15603569271 11.616037125068 4.47694619138194e-31
interior_condition -27376.7537282187 2327.66129210213 -11.7614851529771 8.18917190512961e-32
MedRent 88.3085472141884 7.69777285965703 11.4719606338349 2.36040688662893e-30
pctWhite 71436.995176292 11047.5985833915 6.46629171372025 1.03175392846032e-10
pctPoverty -19795.163407382 15307.0737134964 -1.29320363760503 0.195958219843113
number_of_bedrooms -8928.06586540461 1405.74595219789 -6.35112329610164 2.19175899702311e-10
number_of_bathrooms 54200.6882615458 2832.14465853308 19.1376835566087 8.67277724331793e-81
number_stories 55469.0142181736 2561.34498284452 21.65620585657 1.29228892205375e-102
fireplaces 131977.050015673 4871.49752175237 27.0916795967491 2.79752026020309e-158
view_type0 30164.7663167493 39225.3064320629 0.76901289143512 0.441896346863053
view_typeA 42560.5610636493 34061.9525013105 1.24950444523143 0.211497840301431
view_typeB 22733.8446905374 44665.9572640299 0.508974755788907 0.610776528067202
view_typeC 137058.493672024 34449.3418189484 3.97855188039143 6.96239442162772e-05
view_typeD 1320.74763122881 38150.6467765912 0.0346192723537051 0.972383740258559
view_typeE 5175.73188579092 38303.2582857241 0.135125107299813 0.892514560088899
view_typeH 43697.3713165848 36831.7445584247 1.1864051469859 0.235478928947373
view_typeI 28241.8319466729 32914.2503607722 0.858042690844086 0.390880993879816
year_built 181.63042596967 41.0328751236527 4.42646111008126 9.63883824241568e-06
total_area 13.0919713374087 0.427334465341374 30.6363572312153 1.11090627608545e-200
crimebuff -513.734803477118 65.6222415318643 -7.82866893121387 5.21942684322719e-15
crime_nn4 -1370.77359094705 294.237347651186 -4.65873418819723 3.20571756978093e-06
crime_nn5 1580.60105777607 285.339657858229 5.53936690623423 3.08070891665012e-08
playbuff -435.669818952762 276.452786302405 -1.57592847871026 0.115060904345149
schoolbuff -1941.00471070437 1061.19115881317 -1.82908111755773 0.0674049745077005
school_nn2 60.4001103179244 29.7042248738397 2.03338449578997 0.0420292012746491
school_nn3 -82.1364671700063 44.9218890835185 -1.82842860898611 0.067502773298799
school_nn5 44.5936858264166 24.1702810863808 1.84498002596849 0.0650578358860428
medbuff 18077.5575981134 1675.27114253341 10.7908249232872 4.65499360238264e-27
grocerbuff 1588.25993391847 423.854848752827 3.7471788717101 0.000179426257512695
treebuff 120.486094704403 3.69088968185975 32.6441874696436 9.25138387252468e-227
fmarketbuff 306.297353615634 1351.17667003135 0.226689344487073 0.820668040361049
HVI_SCORE -1064.31695809294 996.774445125994 -1.06776108004897 0.285643468590333
R-squared NA NA NA 0.569180698634477

5.2 Table of goodness of fit (Test Set)

It is important to note that R-squared does not indicate whether the model itself is reliable, or whether the coefficient estimates and predictions are biased. Thus, the following table looks at the goodness of fit for the test set of the model, to better understand the predictive accuracy and performance of the model. Specifically, Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE). According to this table, the calculation of the MAE (mean absolute error) for this seed/iteration of the model is $77226.66, while the MAPE (mean absolute percentage error) is about 55%. These metrics offer insights into the model’s predictive accuracy and the degree of error in the predicted values. Given that lower values for MAE and MAPE are desirable, this means that our model is of higher error. This will be further explored in the following sections.

R = cor(test$sale_price, test$sale_price.Predict, method = "pearson", use = "complete.obs")
R2 = R*R
MAE = mean(test$sale_price.AbsError, na.rm = T)
MAPE = mean(test$sale_price.APE, na.rm = T)

testsum <- as.data.frame(cbind(R2, MAE, MAPE))

colnames(testsum)<-c("R Squared", "MAE", "MAPE")

kable(testsum, caption = "", align = "c") %>%
  kable_classic(full_width = T, html_font = "Helvetica", font_size = 15) %>%
  row_spec(0, bold = T) %>%
  footnote(general = "", fixed_small_size = T, general_title = "")
R Squared MAE MAPE
0.5583617 76894.81 0.556145

5.3 Cross-Validation MAE histogram

A cross-validation MAE histogram can allow us to look at the generalizability of the model over differnet iterations of sets of data. According to the histogram, the mean absolute error has a frequency curve that is highest at approximately $80k, which is very similar ot its mean value. This spread however is not symmetrical, with a somewhat wider spread and certain outliers (particularly towards the high end) meaning our model may not be as accurate.

fitControl <- trainControl(method = "cv", number = 100)

reg.cv <- 
  train(sale_price ~ ., data = st_drop_geometry(sdmerge.modelling) %>% 
                             dplyr::select(sale_price,
                                           exterior_condition,
                                           garage_spaces,
                                           interior_condition, 
                                           MedRent,
                                           pctWhite,
                                           pctPoverty,
                                           number_of_bedrooms,
                                           number_of_bathrooms,
                                           number_stories,
                                           fireplaces,
                                           garage_spaces,
                                           view_type,
                                           year_built,
                                           total_area,
                                           crimebuff,
                                           crime_nn4,
                                           crime_nn5,
                                           playbuff,
                                           schoolbuff,
                                           school_nn2,
                                           school_nn3,
                                           school_nn5,
                                           medbuff,
                                           grocerbuff,
                                           treebuff,
                                           fmarketbuff,
                                           HVI_SCORE
                                           ),
     method = "lm", trControl = fitControl, na.action = na.pass)


MAE.cv = data.table(MAE.cv= reg.cv$resample[,3])
MAE.sd <- as.character(round(sd(reg.cv$resample[,3])))
MAE.mean <- as.character(round(mean(reg.cv$resample[,3])))

caption <- paste("Sd =",MAE.sd, "  Mean =",MAE.mean, " K = 100")

ggplot(MAE.cv, aes(x = MAE.cv))+
  geom_histogram(fill=bluePalette5[3], color = greyPalette5[1], size = 2, xlim=c(20,50))+
  labs(title = "Cross-Validation MAE histogram",
       subtitle = caption,
       x="MAE",
       y="Count",
       caption = "")  +
  plotTheme()

5.4 Predicted prices as a function of observed prices

The following scatterplot plots predicted prices as a function of observed prices. According to this scatterplot, the model performs rather well, considering the line of best fit for the the points almost completely is overlapped by the line from the prediction model. However, it is still interesting to observe the various outliers that would result in high error, particularly those with a much higher observed price than predicted price. These points are visualized much higher than the prediction line.

set.seed(750)

sdmerge.all <- sdmerge %>%
  filter(toPredict %in% c("MODELLING", "CHALLENGE"))
 
reg.sdmerge.all <- 
  lm(sale_price ~ ., data = as.data.frame(sdmerge.all) %>% 
                             dplyr::select(sale_price,
                                           exterior_condition,
                                           garage_spaces,
                                           interior_condition, 
                                           MedRent,
                                           pctWhite,
                                           pctPoverty,
                                           number_of_bedrooms,
                                           number_of_bathrooms,
                                           number_stories,
                                           fireplaces,
                                           garage_spaces,
                                           view_type,
                                           year_built,
                                           total_area,
                                           crimebuff,
                                           crime_nn4,
                                           crime_nn5,
                                           playbuff,
                                           schoolbuff,
                                           school_nn2,
                                           school_nn3,
                                           school_nn5,
                                           medbuff,
                                           grocerbuff,
                                           treebuff,
                                           fmarketbuff,
                                           HVI_SCORE,
                                           ))

sdmerge.all <-
  sdmerge.all %>%
  mutate(Regression = "Baseline Regression",
         sale_price.Predict = predict(reg.sdmerge.all, sdmerge.all),
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price.Predict)%>%
  filter(sale_price < 5000000) 

ggplot(sdmerge.all, aes(x = sale_price.Predict, y = sale_price)) +
  geom_point(size = 0.8, alpha = 0.8, color = "darkblue") +
  geom_abline(color = greyPalette5[4]) +
  geom_smooth(method = "lm", se = FALSE, colour = bluePalette5[5], size = 1) +
  labs(title = "Predicted prices as a function of Observed prices",
       subtitle = "Grey line: Line of Best Fit | Blue line: Prediction",
       x = "Predicted price",
       y = "Observed price",
       caption = "") +
  plotTheme()

5.5 Map of residuals with Moran’s I test

The following Map of Residuals of the test set of the model allows us to understand the error and accuracy of the model. In the map, points with a deeper blue shade are overvalued while very light shades are undervalued. Unfortunately, our model appears not to be performing well across different neighbourhoods in Philadelphia, with deep blue (overvaluing home value) in parts of Fishtown and South Philadelphia, and light blue (near white) more towards the northeast fringes. These spatial patterns will be examined through the Moran’s I test below.

ggplot() +
  geom_sf(data = ct, color = "black", size = 0.3) +
  geom_sf(data = test, aes(colour = q5(sale_price.Error)), alpha = 0.4, size = 0.5, show.legend = "point")+
  labs(title = "Map of Residuals (Test Set)", 
       subtitle = "",
       caption = '') +
  scale_colour_manual(values = bluePalette5,
                      labels=round(as.numeric(qBr(test,"sale_price.Error"))),
                      name="Residuals ($)",
                      na.translate=FALSE) +
  mapTheme()

The following Moran’s I test and chart visualize the spatial autocorrelation of the observations. The point of the test is to understand the similarity of home value spatially–that is, to those near a particular home. The line is slightly to the right of the average, suggesting that there is a moderate level of positive spatial autocorrelation or clustering, meaning certain homes close together may be similar in price, but this may not be the case across the board–especially for extreme values.

coords.test <-  st_coordinates(test)
neighborList.test <- knn2nb(knearneigh(coords.test, 5)) # k = 5, five nearest points

spatialWeights.test <- nb2listw(neighborList.test, style="W")

test$lagPrice <- lag.listw(spatialWeights.test, test$sale_price, NAOK = TRUE)
test$lagPriceError <- lag.listw(spatialWeights.test, test$sale_price.Error, NAOK = TRUE)

#imputing mean values!
missing_values <- which(is.na(test$sale_price.Error))
test$sale_price.Error[missing_values] <- mean(test$sale_price.Error, na.rm = TRUE)

moranTest <- moran.mc(test$sale_price.Error, spatialWeights.test, nsim = 999)

#saveRDS(moranTest, file = "moranTest.RDS")

MT <- ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  plotTheme()

SL1 <- ggplot(test, aes(x = lagPrice, y = sale_price))+
  geom_point(size =0.8, alpha = 0.8, color = bluePalette5[3])+
  geom_abline(color = greyPalette5[4])+
  geom_smooth(method = "lm", se = FALSE, colour = bluePalette5[4], size = 1)+
  labs(title = "Price vs Spatial Lag Price",
       subtitle = "Grey line : Perfect     Blue line : Average",
       x="Spatial Lag Price",
       y="Observed price",
       caption = " ") +
  plotTheme()
 

SL2 <- ggplot(test, aes(x = lagPriceError, y = sale_price.Error))+
  geom_point(size =0.8, alpha = 0.8, color = bluePalette5[3])+
  geom_abline(color = greyPalette5[4])+
  geom_smooth(method = "lm", se = FALSE, colour = bluePalette5[4], size = 1)+
  labs(title = "Price Error vs Spatial Lag Price Error",
       subtitle = "Grey line : Perfect     Blue line : Average",
       x="Spatial Lag Error",
       y="Error",
       caption = "") +
  plotTheme()


MT

Lastly, the following two charts show functions to their respective spatial lag value. The concept of price clustering is best visualized here. We see a rather strong positive relationship with the price value and the spatial lag price, meaning that an increase in the price results in an increase of the spatial lag price, representing nearby houses. This is similar for error as well, where greater price error results in greater price error for nearby houses.

grid.arrange(SL1, SL2, ncol=2)

5.6 Map of Predicted Values

The following maps visualizes all the sale price values across Philadelphia. The map on the right has predicted values compared to the map on the left with observed values. When looking at these two visualizations, it appears like the prediction somewhat matches that of the observed values–at least in terms of clustering of certain values in particular neighbourhoods.

PP <- ggplot() +
  geom_sf(data = ct, color = "black", size = 0.3) +
  geom_sf(data = sdmerge.all, aes(colour = q5(sale_price.Predict)), alpha = 0.4, size = 0.5, show.legend = "point")+
  labs(title = "Map of Predicted House Price", 
       subtitle = "Entire data set",
       caption = '')+
  scale_colour_manual(values = orangePalette5,
                      labels=round(as.numeric(qBr(sdmerge.all,"sale_price.Predict"))/1000),
                      name="Predicted\nPrice\n(1,000$)",
                      na.translate=FALSE)+
  mapTheme()

OP <- ggplot() +
  geom_sf(data = ct, color = "black", size = 0.3) +
  geom_sf(data = sdmerge.all, aes(colour = q5(sale_price)), alpha = 0.4, size = 0.5, show.legend = "point")+
  labs(title = "Map of Observed House Price", 
       subtitle = "Entire data set",
       caption = '')+
  scale_colour_manual(values = bluePalette5,
                      labels=round(as.numeric(qBr(sdmerge.all,"sale_price"))/1000),
                      name="Observed\nPrice\n(1,000$)",
                      na.translate=FALSE)+
  mapTheme()

grid.arrange(OP, PP, ncol = 2)

5.7 Map of MAPE by Census Tract

Using the test set predictions, the following map was created below to visualize mean absolute percentage error (MAPE) across Philadelphia. In this case, census tract was chosen as a geography instead as we believe it gave a more detailed spatial understanding of error.

However, it is important to note that many of the MAPE values were not included in the map because they were infinite and unable to be represented in the map. Only finite values for this column were included, leading to less data to visualize. Also, given that the values are mean values across the census tract, it is possible that the APE values for different homes in one census tracts were very high or very low, leading to a “cancelling out” effect where the MAPE appears lower than it should for the error that occurred in the model. This also could have lead to the infinite values that could not be mapped at all. Overall, this map visualizes MAPE across census tracts, however, may not be the most accurate way of visualizing the accuracy.

test.MAPE <- test %>%
  group_by(census_tract) %>%
  summarise(meanMAPE = mean(sale_price.APE), meanPrice = mean(sale_price)) 

test.MAPE <- st_join(sd_ct, test.MAPE, join = st_intersects)
test.MAPE <- test.MAPE[is.finite(test.MAPE$meanMAPE), ]

ggplot() +
  geom_sf(data = ct, color = "black", size = 0.3) +
  geom_sf(data = test.MAPE, aes(fill = meanMAPE), alpha = 0.4, size = 0.5)+
  scale_fill_gradient(low = "lightblue", high = "darkblue", guide = "legend",
                      name = "Mean Average Percent Error", 
                      breaks = seq(min(test.MAPE$meanMAPE), max(test.MAPE$meanMAPE), length.out = 5)) +
  labs(title = "Mean Average Percent Error across Census Tracts, Philadelphia",
       subtitle = "",
       caption = "") +
  mapTheme()

5.8 Scatterplot plot of MAPE by census tract as a function of mean price

The following scatterplot shows MAPE by census tract as a function of mean price by census tract. Similar to the map, this plot visualizes most of the available observations having very low MAPE values. It is very likely that the MAPE are not accurately represented in this scatterplot, resulting in a skewed diagram, given that the previous tables showed higher absolute error.

ggplot(test.MAPE , aes(x = meanMAPE, y = meanPrice))+
  geom_point()+
  geom_smooth(method = "lm", se = FALSE, colour = bluePalette5[4], size = 1)+
  labs(title = "Scatterplot of MAPE as a function of mean price",
       subtitle = "by census tract",
       x="MAPE",
       y="price",
       caption = "") +
  theme(axis.title.x = element_text(size = 10),
        axis.title.y = element_text(size = 10),
        text = element_text(size = 10),
        panel.background = element_rect(fill = greyPalette5[1]),
        plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5, face ="italic"),
        plot.caption=element_text(hjust = 0.5)
        )

5.9 Split by Census Groups

We split our study area into two groups with respect to poverty and income. In the poverty context map below, the blues correspond to areas with low poverty and white correspond to areas with high poverty. Also, in the income context map, the blues correspond to high income and whites correspond to low income. When we compare this to the map of predicted house prices, it is evident that the model accurately predicts for relatively high and relatively low values accurately. Thus, the model is generalizable for Philadelphia. For example, the Western edge of the city, where there is low poverty and high income has predicted house values which are quite high. Whereas, for the southwestern part of the city which has high poverty and low income, the model predicts lower house prices. This is of course based on the assumption that areas with low poverty and high income already correspond to higher observed house prices.

nhoods.split <- nhoods %>%
  mutate(povertyContext = ifelse(pctPoverty > .15, "High Poverty", "Low Poverty"),
         incomeContext = ifelse(MedHHInc > mean(MedHHInc,na.rm = T), "High", "Low"))

CT1 <- ggplot() +
  geom_sf(data = ct, color = "black", size = 0.3) +
  geom_sf(data = nhoods.split, aes(fill = povertyContext), color = greyPalette5[2], size = 0.2) +
  scale_fill_manual(values = c(bluePalette5[1],bluePalette5[4]),
                    name = "Poverty Context",
                    na.translate=FALSE) +
  labs(title = "Poverty Context", 
       subtitle = "Low Poverty Percentage : Percent of poverty < 15%",
       caption = '') +
  mapTheme()

CT2 <- ggplot() +
  geom_sf(data = ct, color = "black", size = 0.3) +
  geom_sf(data = nhoods.split, aes(fill = incomeContext), color = greyPalette5[2], size = 0.2) +
  scale_fill_manual(values = c(bluePalette5[4],bluePalette5[1]),
                    name = "Income Context",
                    na.translate=FALSE) +
  labs(title = "Income Context", 
       subtitle = "High Income : Income > Median Income",
       caption = '') +
  mapTheme()

grid.arrange(CT1, CT2, ncol = 2)

6. Discussion

This model is not the most effective and we would not recommend it to Zillow because of the high values of mean absolute errors and mean absolute percent error. This model does predict most values accurately but is poor in accounting for outliers. Some of the most interesting variables were demographic like median household income and percent poverty because Philadelphia is already a very polar city; areas very distinctly have very high or very low incomes and this very directly helps us predict for the city well. It was also interesting to see variables like heat and trees feed into the model to make it more accurate. This means that people have in the past and do very actively care about climatic factors and want to take that into consideration while buying a home.

We could predict variation in prices closest to the mean well but not for outliers. According to the scatterplot under 5.4, the model performs rather well, considering the line of best fit for the points almost completely is overlapped by the line from the prediction model. However, it is still interesting to observe the various outliers that would result in high error, particularly those with a much higher observed price than predicted price. Thus, the model does not predict outliers that well, suggesting it is not generalizable beyond its current observations.

The model does take into account very interesting factors and features that make it much more local and context based. Adding some of these has proven to improve the models fit like trees, crime and heat index, getting the accuracy of the model to 57%. This model thus performs well in generalizability but poorly in accuracy because of outliers.

According to our maps, the model performs very well in corresponding to areas with high or low observed house prices. The two maps are very aligned and it shows that the model accurately corresponds to different neighborhoods in Philadelphia.

7. Conclusion

Ultimately, we would not recommend our model to Zillow. despite testing with different variables in our model making process, our model still was only able to account for 57%. In order to make it better, we would have to add more local intel like percent home ownership, proximity to public transit, transit patterns of homeowners. This model can also be improved upon by accounting for external occurrences that cause anomalies, such as the effects of the Covid-19 pandemic, job losses in the recent economy and movement of citizens towards suburbs. This model could also do with more recent data rather than 2021 which makes it outdated in today’s context. This model however provides an excellent experimentation into local intel and should serve as a good base for Zillow to further build on.